Regularization Methods for High-Dimensional 
Instrumental Variables Regression With an Application 

<r>: to Genetical Genomics 

o: 

$_, . Wei Lin, Rui Feng, and Hongzhe Li 

<\ 

-^ . Abstract 

In genetical genomics studies, it is important to jointly analyze gene expression 

data and genetic variants in exploring their associations with complex traits, where 

<^5 ' the dimensionality of gene expressions and genetic variants can both be much larger 

than the sample size. Motivated by such modern applications, we consider the problem 

c/2 ' of variable selection and estimation in high-dimensional sparse instrumental variables 

models. To overcome the difficulty of high dimensionality and unknown optimal instru- 

£> , ments, we propose a two-stage regularization framework for identifying and estimating 

important covariate effects while selecting and estimating optimal instruments. The 

methodology extends the classical two-stage least squares estimator to high dimensions 

suiting procedure is efficiently implemented by coordinate descent optimization. In the 

en ■ 

representative case of L\ regularization, we establish estimation, prediction, and model 
J> ■ selection properties of the two-stage regularized estimators in the high-dimensional set- 

ting where the dimensions of covariates and instruments are both allowed to grow ex- 
ponentially with the sample size. The practical performance of the proposed method is 
evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse 
obesity data. 
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1 Introduction 

Genome-wide studies have been widely conducted to search tens of thousands of gene ex- 
pressions or hundreds of thousands of single nucleotide polymorphisms (SNPs) to detect 
associations with complex traits. By measuring and analyzing gene expressions and genetic 
variants on the same subjects, genetical genomics studies provide an integrative and powerful 
approach to addressing fundamental questions in genetics and genomics at the functional 
level. In these studies, gene expression levels are viewed as quantitative traits that are sub- 
ject to genetic analysis for identifying expression quantitative trait loci (eQTLs), in order to 
understand the genetic architecture of gene expression variation. The increasing availability 
of high-throughput genetical genomics data sets opens up the possibility of jointly analyz- 
ing gene expression data and genetic variants in exploring their associations with complex 
traits, with the goal to identify key genes and genetic markers that are potentially causal for 



complex human diseases such as obesity, heart disease, and cancer (lEmilsson et al.ll2008f ). 

Although in the past decade gene expression profiling has led to the discovery of many 
gene signatures that are highly predictive for clinical outcomes, the effort of using these find- 
ings to dissect the genetics of complex traits and diseases is often compromised by the critical 
issue of confounding. It is well known that many factors, such as unmeasured variables, ex- 
perimental conditions, and environmental perturbations, may exert pronounced influences 
on gene expression levels, which may in turn induce spurious asso ciations and/or distort 



true associations of gene expres sions with the response of interest (ILeek and Storey 



2007 



Fusi. Steele, and Lawrencdl2012l ). Moreover, due to the difficulty of high dimensionality, em- 
pirical studies are mostly based on marginal models, which are especially prone to variability 
caused by pleiotropic effects and dependence among genes. Ignoring these confounding issues 
tends to produce results that are both biologically less interpretable and less reproducible 
across independent studies. 

Instrumental variables (IV) methods provide a practical and promising approach to con- 
trol for confounding in genetical genomics studies, with genetic variants playing the role of 
instruments. This approach exploits the reasonable assumption that the genotype is assigned 
randomly, given the parents' genes, at meiosis and independently of possible confounding 
factors, and affects a clinical phenotype only indirectly through some intermediate pheno- 
types. In observational epidemiology, Mendelian randomization has been proposed as a class 
of methods for using genetic variants as instrume nts to assess the caus al effect of a modifiable 
phenotype or exposure on a disease outcome; see 



Lawlor et al. 



(120081 ) for a review. The pri- 



mary scenario considered in this context, however, involves only one exposure variable and 



requires the existence of a genetic variant whose relationship with the exposure has been 
well established. Thus, the methodology intended for Mendelian randomization is typically 
not applicable to genetical genomics studies, where the number of expression phenotypes 
is exceedingly large and the genetic architecture of each phenotype may be complex and 
unknown. 

IV models and methods have been extensively studied in the econometrics literature, 
wher e the problem is often cast in the framework of simultaneous equation models (IHeckman 
19781 ). It has been shown that classical IV estimators such as the two-stage least squares 
(2SLS) estimator and the limited information maximum likelihood (LIML) estimator are 
consistent only when the number of instruments increases s lowly with the sample size 



( IChao and Swanson 



2005 



Hansen. Hausman. and Neweyl 120081 ) . Recent developments have 



introduced regularization methods to mitigate the overfitting problem in high- dimensional 
feature space by exploiting the sparsit y of importan t covariates, thereby improving the perfor- 
mance of IV estimators substantially. ICanerl (120091 ) considered penalized generalized method 
of moments (GMM) with the bridge p enalty for variable sele c tion a nd estimation in the 
classical setting of fixed dimensionality. iGautier and Tsybakovl ( 1201 ll ) developed a Dantzig 
selector-type procedure to select important covariates and estimate the noise level simulta- 
neously in high- dimensional IV models where the dimensionality may be much larger than 
the sample size. Un der the assumption t hat the important covariates are uncorrelated with 
the regression error, iFan and Liaol (120121 ) proposed a penalized focused GMM method based 
on a nonsmooth loss function to perform variable selection and achieve oracle properties in 
high dimensions. All the aforementioned methods, however, do not exploit the sparsity of 
instruments and hence are still facing the dimensionality curse of many instruments. An- 
other active line of research in the econometrics literature has been concerned with the use 
of regularization and shrinkage methods for estim ating optima l inst ruments in the c ontext of 
estimating a low- dimensional param eter; see, e.g., 
ticular interest is the recent work of 



Belloni et al 



Okui 



20111) and lCarrascolfbom Of par 



( 120121 ). where Lasso-based methods were 



applied to form first-stage predictions and estimate optimal instruments in an IV model with 
many instruments but itself of fixed dimensionality. 

In this article, we focus on the application of high- dimensional sparse IV models to ge- 
netical genomics, where we are interested in associating gene expression data with a complex 
trait to identify potentially causal genes by using genetic variants as instruments. Motivated 
by this important application, we propose a two-stage regularization (2SR) methodology for 
identifying and estimating important covariate effects while selecting and estimating optimal 



instruments. Our approach merges the two independent lines of research mentioned above 
and provides a regularization framework for IV models that accommodate covariates and 
instruments both of high dimensionality. Specifically, the proposed procedure consists of 
two stages: In the first stage the covariates are regressed on the instruments in a regularized 
multivariate regression model and predictions are obtained, and in the second stage the re- 
sponse of interest is regressed on the first-stage predictions in a regularized regression model 
to perform final variable selection and estimation. In each stage, a sparsity-inducing penalty 
function is employed to yield desirable statistical properties and practical performance. The 
methodology can be viewed a high-dimensional extension of the 2SLS method, allowing the 
use of regularization methods to address the high-dimensional challenge in both stages. 

Several key features make the proposed methodology especially appealing for the kind of 
applications we consider in this article. First, unlike marginal regression models commonly 
used in empirical studies that analyze a few variables at a time, our method allows for the 
joint modeling and inference of high-dimensional genetical genomics data. In view of the 
fact that many genes interact with each other and contribute together to a complex trait 
or disease, joint modeling is crucial for correcting bias and controlling false positives due 
to possible confounding. Second, our method requires neither a specification of a small set 
of important instruments nor an importance ranking among the instruments; instead, we 
consider the estimation of optimal instruments as a variable selection problem and allow 
the procedure to choose important instruments based on the data. Third, the proposed 
implementation by coordinate descent optimization is computationally very efficient and 
has provable convergence properties, therefore bypassing the computational obstacles faced 
by traditional model selection methods. Finally, we rigorously justify our method for the 
representative case of L\ regularization in the high-dimensional setting where the dimensions 
of covariates and instruments are both allowed to grow exponentially with the sample size. 
Through the theoretical analysis, we explicate the impact of dimensionality and the role of 
regularization, and provide strong performance guarantees for the proposed method. 

The remainder of this article is organized as follows. Section 2 introduces the high- 
dimensional sparse IV model. The 2SR methodology and implementation are presented in 
Section 3. Theoretical properties of the regularized estimators are investigated in Section 4. 
We illustrate our method by simulation studies in Section 5 and an analysis of mouse obesity 
data in Section 6. We conclude with some discussion in Section 7. All proofs are relegated 
to the Appendix. 




Figure 1: A causal diagram showing the relationships between two genotypes z\ and Z2, two 
gene expression levels x\ and X2, a clinical phenotype y, and a confounding phenotype w. 

2 Sparse Instrumental Variables Models 

Suppose we have a quantitative trait or clinical phenotype y, a p-vector of gene expression 
levels x, and a q- vector of numerically coded genotypes z. In reality, there may be a sufficient 
set of unobserved confounding phenotypes w that act as proxies for the long-term effects 
of environmental exposures and/or the state of the microenvironment of the cells or tissues 
within which the biological processes occur. These phenotypes are likely to have strong 
influences on gene expression levels while contributing substantially to the clinical phenotype. 
Figure [T] illustrates the confounding between x and y with an example of six variables. If an 
ordinary regression analysis is to be applied, the effects of x\ and x 2 on y would be seriously 
confounded by w, resulting in an effect modification for x\ and a spurious association for X2- 
One way of controlling for the confounding due to w is through the use of the genotype 
z as inst ruments. In order for z to be valid instruments, the following conditions must be 



satisfied (jDidelez. Meng. and Sheehanll201Q[ ): 

1. The genotype z is (marginally) independent of the confounder w; 

2. The genotype z is not (marginally) independent of the intermediate phenotype x; 

3. Conditionally on x and w, the genotype z and the response y are independent. 

The above conditions are not easily testable from the observed data, but can be justi- 
fied on the basis of plausible biological assumptions. Condition 1 is ensured by the usual 
assumption that the genotype is assigned at meiosis randomly, given the parents' genes, and 
independently of any possible confounder. Condition 2 requires that the genetic variants be 
reliably associated with the gene expression levels, which is often demonstrated by cis-eQTLs 
with strong regulatory signals. Condition 3 requires that the genetic variants have no direct 
effects on the clinical phenotype and can affect the latter only indirectly through the gene 
expression phenotypes. Owing to the large pool of gene expressions included in genetical 
genomics studies, the possibility of a strong indirect effect is greatly reduced and hence this 



condition is also mild and tends to be satisfied in practice. 

Suppose we have n independent observations of (y, x, z). Denote by y, X, and Z, respec- 
tively, the nxl response vector, the nxp covariate matrix, and the nx q genotype matrix. 
Using the genotypes as instruments, we consider the following linear instrumental variables 
(IV) model for the joint modeling of the data (y, X, Z): 

y = X/3 + r/, 

x = zr + E, 

where (3 and T are a p x 1 vector and a qxp matrix, respectively, of regression coefficients, 
and rj = (rji, ... , r] n ) T and E = (ei, . . . , e n ) T are an n x 1 vector and an n x p matrix, 
respectively, of random errors such that the (p + 1)- vector (ef,T]i) is multivariate normal 
conditional on Z with mean zero and covariance matrix S = (o~jk)- We write ctjj = <r|. 
Without loss of generality, we assume that each variable is centered about zero so that no 
intercept terms appear in ([I]), and that each column of Z is standardized to have L2 norm 
\/n. We emphasize that e, and r\i may be correlated because of the arbitrary covariance 
structure. In contrast to the ordinary linear model regressing y on X, model (JJJ does not 
require that the covariate X and the error r\ be independent, thus substantially relaxing the 
assumptions of ordinary regression models and being more appealing in data analysis. 

We are interested in making inference for the IV model (pQ) in the high-dimensional 
setting where the dimensions p and q can both be much larger than the sample size n. 
In addition to selecting and estimating important covariate effects, since the identities of 
optimal instruments are unknown, we also regard the identification and estimation of optimal 
instruments as a variable selection and estimation problem. As is typical in high-dimensional 
sparse modeling, we assume that model ([1]) is sparse in the sense that only a small subset of 
the regression coefficients in /3 and T are nonzero. Our goal is, therefore, to identify and 
estimate the nonzero coefficients in both (3 and IV 

3 Regular izat ion Methods and Implementation 

In this section, we present the two-stage regularization methodology for the sparse IV model 
(JT]), propose an efficient coordinate descent algorithm for implementation, and discuss the 
selection of tuning parameters. 

3.1 Two-Stage Regularization 

It is well known that the ordinary least squares estimator is inconsistent in the presence 
of endogeneity, i.e., when some covariates are correlated with the regression error. One 



6 



standard way of eliminating the endogeneity is to replace the covariates by their expectations 
condition al on the instrum ents. This idea leads to the classical two-stage least squares (2SLS) 
method (jAndersonl 120051 ). in which the covariates are first regressed on the instruments 
and the response is then regressed on the first-stage predictions of the covariates. The 
performance of the 2SLS method deteriorates drastically or become inapplicable, however, 
as the dimensionality of covariates and instruments increase. We thus propose to apply 
regularization methods to cope with the high dimensionality in both stages of the 2SLS 
method, resulting in the following two-stage regularization (2SR) methodology. 

Stage 1. The goal of the first stage is to identify and estimate the nonzero effects of 
the instruments and obtain the predicted values of the covariates. We denote by || • \\p the 
Frobenius norm of a matrix. The first-stage regularized estimator is defined as 



argmm< — ||X 
rgMgxp I 2n 



q p 



(2) 



t=l j=l J 

where 7^ is the (i, j)th entry of the matrix T, p x (■) is a sparsity-inducing penalty function 
to be discussed later, and Xj > are tuning parameters that control the strength of the first- 
stage regularization. After the estimate T is obtained, the predicted value of X is formed by 

x = zf. 

Stage 2. Substituting the first-stage prediction X for X, we proceed to identify and 
estimate the nonzero effects of the covariates. The second-stage regularized estimator is 
defined as 

(3) 



3 = argminjily - X(3\\ 2 2 + f> M (|/3,|)l, 



where (3j is the jth component of f3, p M (-) is a sparsity-inducing penalty function as before, 
and fj, > is a tuning parameter that controls the strength of the second-stage regularization. 
We thus obtain the pair (/3,T) as our final estimator for the regression parameter (/3 ,r ) 
in model (CD). 

We consider t he following three choices of the penalty function p\(t) for t > 0: (a) the L\ 
penalty or Lasso ( Tibshiranil 199 6). p\(t) = At; (b) the smoothly clipped absolute deviation 



(SCAD) penalty f lFan and Lill2001 



p x (t) = A / ll(9<\) + 



u 



^^/(J>A)}<». 



a > 2; 



o 



and (c) the minimax concave penalty (MCP) (jZhangll2010l ) 

/" {a\ - 6) 
'0 



Px(t) 



dB, 



a > 1. 



a 



The SCAD and MCP penalties have an additional tuning parameter a to control the shape of 
the function. These penalty functions have been widely used in high- dimensional sparse mod 
elingand their properties are well understood in ordinary regression models; see 



Fan and Lv 



( 120101 ) for a review. Moreover, the fact that these penalties belong to the class of quadratic 
spline functions on [0, oo) allows for a closed-form solution to the corresponding penalized 
least squares pr oblem in each coordinate, leading to ver y efficient implementation via coor- 



dinate descent (IMazumder. Friedman, and Hastie 



20111 ). 



3.2 Implementation 

We now present an efficient coordinate descent algorithm for solving the optimization prob- 
lems (T5]) and ([3]) with the Lasso, SCAD, and MCP penalties. We first note that the matrix 
optimization problem (J2J) can be decomposed into p penalized least squares problems, 

7 = argmiJ — || Xj - Z7JI + YVa.G^jI) \, (4) 

where x^ is the jth column of the covariate matrix X and 7- = (71-,-, . . . ,'jqj) T - The uni- 
variate solution to the unpenalized least squares problem @ is given by 7^ = n~ 1 ('Xj — 
5^ fc ^j 7fcjZfc) T Zj = n _1 rjzj + 7^, where Zj is the jth column of the instrument matrix Z, 
Tj = Xj — Ysl=i Ikjik is the current residual, and we have used the fact n~ 1 zjzj = 1 due to 
standardization. The penalized univariate solution, then, can be obtained by 7^ = Si^if A), 
where S{-\ A) is a thresholding operator defined for Lasso, SCAD, and MCP, respectively, as 
<S , l b88 o(*; A) = sgn(t)(|i| - A)+, 



sgn(t)(|t|-A) + , if |£|<2A, 

. . \t\ — Xa/(a — 1) „ , 

Sgnt ' /V ; , if2A<t 

1 — l/(a — 1) 

t, if |t| > aA, 



and 



sgnft) ^ 1 A)+ , if It| < aA, 

Smcp(£;A)=<( l-l/a 

t, if \t\ > aA. 

Similarly, if the jth column X,- of the first-stage prediction matrix X is standardized to 
have L2 norm yjn, the penalized univariate solution for the optimization problem ([3]) is 
given by (3j = S(j3j] //), where (3j = n _1 r T Xj + (3j is the unpenalized univariate solution and 
r = y — YTk=i /^fcXfc is the current residual. We summarize the coordinate descent algorithm 
for computing the 2SR estimator (f3, T) in Algorithm [TJ 



Algorithm 1 Coordinate descent for the two-stage regularized estimator. 
Input: (3 ini , r ini , Ai, . . . , X p , p 
Initialize: (3 <— /3 ini , T 4r- r ini 
for j ' = 1, . . . ,p do 
repeat 

for % = 1 , . . . , q do 

until convergence of 7 ■ 

f <-(7i,...,7 P ),X<-Zf 
repeat 

for j = 1, . . . ,p do 
until convergence of (3 
return (3, T 



The convergence of Algorithm [I] to a local minimum for /3 and T is ensured by the 
convergence propert i es of coor dinate descent algorithms for penalized least squares; see 



Breheny and Huangi (120111 ) and 



Mazumder. Friedman, and Hastiel (120 111 ). Since the SCAD 



and MCP penalties are nonconvex, convergence to a global minimum is not guaranteed in 
general. In practice, coordinate descent algorithms are often used to produce a solution path 
over a grid of regularization parameter values, with warm starts from nearby solutions. In 
this case, the algorithm tends to find a sparse local solution with superior performance. 

3.3 Tuning parameter selection 

The 2SR method has p + 1 regularization parameters Ai, . . . , A p and p to be tuned. We 
propose to choose the optimal tuning parameters by A-fold cross-validation. Specifically, we 
define the cross-validation error for Xj and p, by 

CV(Ai) = iEl|x<* ) -ZW^(A i )||» (5) 

fc=i 

and 

cn»)=^i:\\y {k) -x {k ^ k) mi (6) 



*:=i 



respectively, where x,- , 7S k \ y^ k \ and X^-* are vectors/matrices for the feth part of the 



j 



sample, and T (Xj) and (3 (p) are the estimates obtained with the feth part removed. 
In view of the fact that in typical genetical genomics studies, both p and q can be in the 
thousands, it is necessary to reduce the search dimension of tuning parameters. To this end, 



we propose to first determine the optimal Xj that minimizes the criterion ([5]), for j = 1, . . . ,p, 
and then, with Ai, . . . , X p fixed, find the optimal /i that minimizes the criterion §§§. The 
practical performance of this search strategy proves to be very satisfactory. 

4 Theoretical Properties 

In this section, we investigate the theoretical properties of the 2SR estimators. Through our 
theoretical analysis, we wish to understand (a) the impact of the dimensionality of covariates 
and instruments as well as other factors on the quality of the regularized estimators, and 
(b) the role of the two-stage regularization in providing performance guarantees for the 
regularized estimators, especially for the second-stage estimators. To address (a), we adopt 
a nonasymptotic framework that allows the dimensionality of covariates and instruments to 
vary freely and thus can both be much larger than the sample size; to address (b), we impose 
conditions only on the instrument matrix Z, and treat the covariate matrix X and the first- 
stage prediction X as nondeterministic. The major challenge arises in the characterization 
of the second-stage estimation, where the "design matrix" X is neither fixed nor a random 
design sampled from a known distribution. Therefore, the existing formulations for the 
high-dimensional analysis of ordinary regression models are inapplicable to our setting. We 
also stress that our theoretical analysis is essentially different from the recent developments 



in s parse IV mode l s. Th e methods and results developed by iGautier and Tsybakovl (1201 ll ) 



and iFan and Liaol (120121 ) invol ye only one-stage est imation and regularization. The second- 
stage estimation considered by iBelloni et al.l ( 20121 ) is of fixed dimensionality, which allows 



them to focus on estimation efficiency based on standard asymptotic analysis. Owing to the 
complications involved in the analysis of a general penalty, we consider the representative 
case of L\ regularization in this section, which allows us to obtain much cleaner conditions 
that provide sufficiently useful insights. We also discuss the possibility of a generalization of 
our theoretical results to nonconvex penalties at the end of this section. 

We now introduce some notation. For any matrix A = (a^), denote by ||A||i and 
|| Alloc the matrix 1-norm and oo-norm, respectively, i.e., || A||i = maxj ^ \a,ij\ and UAHoo = 
maxj Yl j \ a ij\- F° r an y vector a., matrix A, and index sets I and J, let olj denote the 
subvector formed by the jth components of a with j e J, and A/j the submatrix formed 
with the (i,j)th entries of A with i 6 / and j G J . Also, denote by J c the compleme nt of 



J and | J| the number of elements in J. Following iBickel. Ritov. and Tsybakovl (120091 ). we 



10 



define the restricted eigenvalue for annxr matrix A and 1 < s < r by 

/ a n • • \\ A $h 

k A.s Emin mm _ ,, - ,, . 

\J\<» 6*0 y/n\\6j\\ 2 

US /o t " " 

l d J c l|iS'j||oj||l 

Define s\ = maxi<j< p |supp(7°)|, s 2 = |supp(/3 )|, and cr max = maxi<j< p <7.,-, where 7° is 
the jth column of the matrix IV We consider the parameter space with 1 1 I^o 1 1 1 < K and 
ll/^o 111 < M for some constants K, M > 0. 

To derive nonasymptotic bounds on the estimation and prediction loss of the regularized 
estimators T and (3, we need to impose the following conditions: 

(CI) There exists some K\ > such that k(Z, si) > K\. 
(C2) There exists some /^ > such that K(Zr , S2) > t 2 . 

We emphasize that the dimensions p and g, sparsity levels s\ and S2, and K\ and K2 niay all 
depend on the sample size n; we have suppressed th e notational dependence for concisen ess. 



Conditions (CI) and (C2) are analogous to those in iBickel. Ritov. and Tsybakovl ( 120091 ) for 
ordinary regression models, and require that the matrices Z and ZT be well behaved over 
some restricted sets of sparse vectors. One important difference, however, is that Condition 
(C2) is imposed on the conditional expectation matrix ZTo of X, rather than the first-stage 
prediction matrix, or the second-stage design matrix, X. This condition is more natural in 
our context, but poses new challenges for the analysis. 

The estimation and prediction quality of the first-stage estimator T is characterized by 
the following result. 

Theorem 1 (Estimation and prediction loss of T). Under Condition (CI), if we choose 



hgp + hgq 

J= J y n () 

with a constant C > 2 a/2, then with probability at least 1 — {pq) l ~ c ^ , the regularized esti- 
mator r defined by 02]) with the L\ penalty satisfies 



16C /logp + logg 

1 1 S 5-0"max s l'\/ 

k( V n 



and 



2 



* 16C 

Z(r-r )||^< ai^p Sl (logp + logq). 



Using the prediction bound provided by Theorem [TJ we can show that Condition (C2) 
also holds with high probability for the matrix X = ZT with a smaller k 2 ; see Lemma [A. II 
in the Appendix. This allows us to establish the following result concerning the estimation 
and prediction quality of the second-stage estimator j3. 

11 



Theorem 2 (Estimation and prediction loss of (3) . Under Conditions (CI) and (C2), if the 
regularization parameters Xj are chosen as in (J7J) and satisfy 



X mBX \2K + A max ) < 



k 1 k 2 



32 2 sis 2 
where A max = maxx<j< p Xj, then there exist constants Cq,Ci,C2 > such that, if we choose 



Co s.Qogp + logq) 

ki V n 

where C = CqK max(a p+ i, Ma maiX ), then with probability at least 1— Ci(pq)~ C2 , the regularized 
estimator (3 defined by (J3J) with the L\ penalty satisfies 



M/3 ah ^ 6AC ° /si(logp + logg) 
K1K2 V n 

and 

— — 64C 2 

||X09 - A,) III < ^is lS2 (logp + logg). 

K 1 K 2 

Of particular interest is the model selection consistency of (3. Define the covariance 
matrix C = n~ 1 (Zr ) T Zr , S = supp(/3 ), ip = || (Css) - 1 || 00 , and the minimum signal 
/3 min = rninjes |/3°|, where (3® is the jth component of /3 . We replace Condition (C2) by the 
following condition: 

(C3) There exists a constant < a < 1 such that ||Csc > 5(Cs5) _1 || 0O < 1 — a. 

Condition (C3) is in the same spirit as the irrepresentability condition in 



Zhao and Yu 



( 120061 ) for the ordinary Lasso problem. Although Condition (C3) is placed on the covariance 
matrix of ZFq, we can apply Theorem [1] to show that it also holds with high probability for 
the covariance matrix of X = Zr with a smaller a; see Lemma IA.3I in the Appendix. The 
model selection consistency of (3, along with an L^ estimation bound, is established by the 
following result. 

Theorem 3 (Model selection consistency of (3). Under Conditions (CI) and (C3), if the 
regularization parameter Xj are chosen as in ((7|) and satisfy 

^^A max (2ir + A max )^ < -j-^— , (10) 

then there exist constants cq,ci,C2 > such that, if the regularization parameter jjl is chosen 

as in (Q and satisfies 

2 

2 — a 
then with probability at least 1 — ci(pq)~ C2 , there exists a regularized estimator (3 defined by 
()3]) with the L\ penalty that satisfies 

12 



(a) (Sign consistency) sgn(/3) = sgn(/3 ), and 

(b) (Lao estimation loss) 

2C <p 



ll/9 S - /3° s 



Slice 



< 



si(logp + logg) 



(2 — ot)K\ V n 

The orem [3] shows that the second-stage estimator (3 has the weak oracle property in the 



sense of 



Lv and Fan 



(120091 ). Two remarks can be made from the analysis of the second-stage 
estimation. First, the validity of our arguments for Theorems |2] and [3] relies on the first-stage 
regularization only through the convergence rate of the prediction loss given by Theorem [TJ 
this allows the arguments to be adapted to other regularization methods for the first stage. 
Second, endogeneity entails that X and rj may be correlated, and we have to make use of 
the key assumption that E and rj are mean zero conditional on Z; see Lemma IA.2I in the 
Appendix. 

Theorems [TH3] deliver the important message that the dimensions p and q contribute only 
a logarithmic factor to the estimation and prediction loss, and thus are both allowed to grow 
exponentially with the sample size n. We note that (jHJ) and f flUj) are critical assumptions 
that relate the first-stage regularization parameter A max to the key quantities in the second 
stage. To gain more insight into the dimension relationships, suppose for simplicity that k±, 
K2, and <f are constants; then (|Sj) and (ITU]) hold for sufficiently large n provided that 

sjsKlogp + logq) = o(n). 

Thus, in this case, the dimensions p and q can grow at most as e°^ and the sparsity levels 
S\ and S2 as o(y/n) if the other quantities are fixed. These results have important practical 
implications and provide strong performance guarantees for the proposed 2SR method. 

Although we have focused on the L\ regularization method, our arguments can in prin- 
ciple be extended to a general class of nonconvex penalties including SCAD and MCP. To 
this end, C onditions (C1)-(C 3) sh ould be replaced by more general, weaker conditions such 



as those in 



Lv and Fan 



(120091 ) and IZhang and Zhang! (120121 ) . One can follow the same argu- 
ments as in Lemmas IA.1I and IA.3I in the Appendix to show that these conditions also hold 
with high probability for the first-sta ge prediction matrix X. O ne extra complication arises 
in showing that the condition (19) in IZhang and Zhang! (120121 ). which is not needed for the 
L\ penalty, also holds with high probability with X replaced by X. 

5 Simulation Studies 

In this section, we report on simulation studies to evaluate the performance of the proposed 
2SR method with the Lasso, SCAD, and MCP penalties. We compare the proposed method 
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with the ordinary penalized least squares (PLS) method with the same penalties that do not 
utilize the instruments, as well as the PLS and 2SR oracle estimators that knew the relevant 
covariates and instruments in advance. We are particularly interested in investigating how 
the PLS and 2SR methods perform differently in relation to the sample size and how the 
dimensionality and instrument strength affect the performance of the 2SR method. 

We first generated the sparse coefficient matrix T by sampling S\ — 5 nonzero entries of 
each column from the uniform distribution U([— b, —a] U [a, b}). To represent different levels of 
instrument strength, we took (a, b) = (0.75, 1) for strong instruments and (a, b) = (0.5, 0.75) 
for weak instruments. Similarly, we generated the sparse coefficient vector /3 by sampling 
S2 = 5 nonzero components from U([— 1, —0.5] U [0.5, f]). The covariance matrix S = ((Jij) 
was specified as follows: We first set <Tjj = (0.2)' J ~ J for i,j = l,...,p, and cr p+ i iP+ i = f; 
in addition to the five <Tj iP+ is corresponding to the nonzero components of /3 , we sampled 
another five entries from the last column of E; we then set these ten entries to 0.3 and let 
Vp+ij = cr j,p+i f° r J = 1> • • ■ iV- Note that the nonzero a.,- p+ is were intended to cause both 
effect modifications and spurious associations for the PLS method. Finally, the instrument 
matrix Z was generated by sampling each entry independently from Bernoulli (0.5), and the 
covariate matrix X and the response vector y were generated according to model ([I]). 

Since the PLS method provides no estimates for the coefficient matrix T and our main 
interest is in how the estimation of f3 can be improved by the 2SR method, we focus our 
comparisons on the second-stage estimation. Five measures on estimation, prediction, and 
model selection qualities were used to assess the performance of each method. The L\ 
estimation loss ||/3 — /3 ||i and the prediction loss n _1 / 2 ||X(/3 — /3 )|| 2 quantify the estimation 
and prediction performance, respectively. The model selection performance is characterized 
by the number of true positives (TP), the model size, and the Matthews correlation coefficient 
(MCC). Here, positives refer to nonzero estimates. The MCC is a measure on the correlation 
between the observed and predicted binary classifications and is defined as 
MCC TP x TN - FP x FN 



^(TP + FP)(TP + FN)(TN + FP)(TN + FN) 
where TN, FP, and FN denote the number of true negatives, false positives, and false neg- 
atives, respectively; a larger MCC indicates a better variable selection performance. In all 
simulations, we applied tenfold cross-validation to choose the optimal tuning parameters and 
averaged each performance measure over 50 replicates. 

We first consider the case where the dimensions p and q are moderately high and smaller 
than the sample size n. The following three models were examined: 
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Model 1 



{n,p, q) = (200, 100, 100) and (a, b) = (0.75, 1); 



Model 2: (n,p,q) = (400,200,200) and (a, b) = (0.75,1); 
Model 3: (n,p,q) = (400,200,200) and (a, b) = (0.5,0.75). 

The simulation results for Models 1-3 are summarized in Table HJ From the table we see 
that the 2SR method improved on the performance of the PLS method substantially in all 
cases. The improvement on model selection performance was most remarkable. The PLS 
method selected an exceedingly large model with many false positives because of its failure in 
distinguishing between the true and confounding effects, whereas the 2SR method resulted 
in a much sparser model and controlled the number of false positives at a reasonable level. 
As a result, the 2SR method had a much higher MCC than the PLS method, indicating a 
superior variable selection performance. The estimation and prediction performance of the 
PLS method was also greatly compromised by the confounding effects, and the 2SR method 
achieved a dramatic improvement on the L\ estimation loss and a considerable improvement 
on the prediction loss. With smaller sizes of the nonzero coefficients in To, Model 3 reflects a 
weaker instrument setting. The comparisons between the results for Model 2 and 3 suggest 
that a weaker instrument strength tends to decrease the performance of the 2SR method, 
as expected, especially on the estimation and prediction quality. We observe, however, that 
the model selection quality was only slightly affected and the overall performance of the 2SR 
method was still very satisfactory. 

To facilitate performance comparisons among different methods with varying sample size, 
Figure |2] depicts the trends in three performance measures with the dimensions p = q = 100 
fixed and the sample size n varying from 200 to 1500. It is clear from Figure [2] that the per- 
formance of the 2SR method improves consistently as the sample size increases, whereas the 
PLS method does not in general see performance gain and may even deteriorate. Moreover, 
a closer look at the tails of the curves for the 2SR method with different penalties reveals 
certain advantages of SCAD and MCP over the Lasso. There seems to be a nonvanishing 
gap between the Lasso and oracle estimators, which agrees with the existing th eory in th e 



context of linear regression that the Lasso does not possess the oracle property ( IZoul 120061 ) . 
We further study the case where the dimensions p and q are ultrahigh and larger than 
the sample size n. The following three models were considered: 

Model 4: {n,p,q) = (300,600,600) and (a, b) = (0.75,1); 
Model 5: (n,p, q) = (500, 1000, 1000) and (a, b) = (0.75, 1); 
Model 6: (n,p, q) = (500, 1000, 1000) and (a, b) = (0.5, 0.75). 

Table [2] summarizes the simulation results for Models 4-6, and Figure [3] shows the perfor- 
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mance curves with the dimensions p = q = 600 fixed and the sample size n varying from 200 
to 1500. Trends in performance comparisons among different methods are similar to those 
in Table [1] and Figure [2l demonstrating the advantages of the 2SR method over the PLS 
method. We observe that, although the ultrahigh dimensionality caused the 2SR method 
to select a larger model and resulted in a slightly lower MCC than in the previous settings, 
the performance of the 2SR method still compared favorably to the PLS method and the 
difference was pronounced for moderate sample sizes. These results suggest that the dimen- 
sionality has only mild impact on the performance of the 2SR method compared with the 
sample size, in agreement with our theoretical results in Section 4. 



6 Analysis of Mouse Obesity Data 

To illustrate the application of the proposed metho d, in this sect i on we present results from 
the analysis of a mouse obesity data set described by lWang et al.l (120061 ). The study includes 
an F2 intercross of 334 mice derived from the inbred strains C57BL/6J and C3H/HeJ on an 
apolipoprotein E (ApoE) null background, which were fed a high-fat Western diet from 8 to 
24 weeks of age. The mice were genotyped using 1327 SNPs at an average density of 1.5 cM 
across the whole genome, and the gene expressions of the liver tissues of these mice were pro- 
filed on microarrays that include probes for 23,388 genes. Data on several obesity-related clin- 
ical traits were also collected on the animals. The genotype, gene expression, and clinical data 



are available for download, respectively, at http://www.genetics.Org/cgi/content/full/genetics.110.116087/I 



ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/SeriesMatrix/GSE2814/, and http://labs.genetics.ucla.edu/ho 

After the individuals, SNPs, and genes with a missing rate greater than 0.1 were removed, 
the remai ning missing genotype and ge ne expression data were impute d using the Beagle 



appro ach (JBrowning and Browningll2007l ) and nearest neighbor averaging (JTroyanskaya et al. 



20011 ). respectively. Merging the genotype, gene expression, and clinical data yielded a 
complete data set with q = 1250 SNPs and 23,184 genes on n = 287 (144 female and 143 
male) mice. To enhance the interpretability and stability of the results, we focus on the 
y = 2 825 genes that can be mapped to the Mouse Genome Database (MGD) (jEppig et al. 



20121 ) and have standard deviation of gene expression levels greater than 0.1. The latter 
criterion is reasonable because gene expressions of too small variation are typically not of 
biological interest and suggest that the genetic perturbations may not be sufficiently strong 
for the genetic variants to be used as instruments. 

Our goal is to jointly analyze the genotype, gene expression, and clinical data to identify 
important genes related to body weight. In order to utilize data from both sexes, we first 
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adjusted the body weight for sex by fitting a marginal linear regression model with sex as the 
covariate and subtracting the estimated sex effect from the body weight. We then applied 
the proposed 2SR method with the Lasso, SCAD, and MCP penalties to the data set with 
adjusted body weight as the response. For comparison, we also applied the PLS method to 
the same data set, and used tenfold cross-validation to choose the optimal tuning parameters 
for both methods. The models selected by cross-validation include 110 (Lasso), 49 (SCAD), 
and 16 (MCP) genes for the PLS method, and include 37 (Lasso), 15 (SCAD), and 9 (MCP) 
genes for the 2SR method. It is as expected that the 2SR method produced much sparser 
models than the PLS method. 

To gai n insight into the stability of the se lection results, we followed the idea of stability 
selection (IMeinshausen and Buhlmannl |2010| ) to compute the selection probability of each 
gene over 100 subsamples of size \n/2\ for each fixed value of the regularization parameter \x. 
The resulting stability paths for different methods are displayed in Figure HI It is interesting 
to observe that, among the genes with maximum selection probability at least 0.4, only 5 
(Lasso), 3 (SCAD), and (MCP) genes are common to both the PLS and 2SR methods. As 
can be seen from Figure HI these few genes, which are reasonably conjectured to be among 
the truly relevant ones, stand out more clearly in the stability paths of the 2SR method. 
Moreover, the overall stability paths of the 2SR method seem less noisy and hence can be 
more useful for distinguishing the most important genes from the irrelevant ones. 

Table |3] lists the genes that were chosen by stability selection with maximum selection 
probability at least 0.5 using the 2SR method with three different penalties. Among these 
17 genes, only three were also selected by the PLS method. This includes insulin-like growth 
factor bind ing protein 2 (Igfbp2), w hich has been shown to protect against the development 



of obesity (j Wheat croft et al. 



2007J). Among the genes identified only by the 2SR method, 



apolipoprotein A-IV (Apoa4) plays an important role in lipoprotein metaboli sm and has 
been implicated in the control of food intake in rodents ( Tso. Sun, and Liull2004j ); it inhibits 
gastric emptying and serves as a satiety factor in response to ingestion of dietary fat. Apoa4 
also acts as an entero gastrone, a humoral inhibitor of gastric acid secretion and motility 
( Qkumura et al.lll994T) . and is regulated by leptin, a major component of energy homeostasis 
( IDoi et al.ll200l[ ). These previous findings suggest a potential role of Apoa4 in the regulation 
of food intake and, consequently, body weight. Suppressor of cytokine signaling 2 (Socs2) 
is a negat ive regulator in the growth hormone/insulin-like growth factor (IGF)-I signaling 
pathway ( jMetcalf et al.ll2000l ). which is directly related to obesity. Slc22a3 is a downstream 
gene of the IGF signaling pathway. Recent studies have showed that the Gpldl gene is asso- 
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dated with serum glycosylphosphatidylinositol-specific phospholipase D (GPI-PLD) levels, 
which predict cha nges in insulin sensitivity in response to a low-fat diet in obese women 



(JGray et al.ll2008h. The IGF-bindin g protein also induces laminin gamma 1 (Lamcl) tran- 



scription ( lAbrass and Hansenll2010l ). These identified genes clearly point out the importance 
of the IGF signaling pathway in the development of obesity in mice. 

Table [3] also presents the czs-SNPs, which are defined to be the SNPs within a 10 cM 
distance of each gene, that are associated with the selected genes. These ds-SNPs are likely 
to play a critical role in the regulation of the target genes and serve as strong instruments in 
statistical analysis. Note that not all selected genes have cis-SNPs identified, partly because 
of the nonuniform distribution of genotyped SNPs in this data set. If the criterion is relaxed 
to within 25 cM of each gene, we find that 13 out of the 17 genes in the table have at least 
one czs-SNP identified. 

7 Discussion 

We have proposed a 2SR method for variable selection and estimation in sparse IV models 
where the dimensions of covariates and instruments can both be much larger than the sample 
size. We have developed a high-dimensional theory that supports the theoretical advantages 
of our method and sheds light on the impact of dimensionality in the resulting procedure. We 
have applied our method to genetical genomics studies for jointly analyzing gene expression 
data and genetic variants to explore their associations with complex traits. The proposed 
method provides a powerful approach to effectively integrating and utilizing genotype, gene 
expression, and clinical data, which is of great importance for large-scale genomic inference. 
We have demonstrated on simulated and real data that our method is less affected by con- 
founding and can lead to more reliable and biologically more interpretable results. Although 
we are primarily motivated by genetical genomics applications, the methodology is in fact 
very general and likely to find a wide range of applications in epidemiology, econometrics, 
and many other fields. 

In our analysis of genetical genomics data, only genetic variants are used as instruments, 
and gene expressions that fail to be associated with any genetic variants in the first stage of 
the 2SR method have to be excluded at the second stage, which may comprise the inferences 
for genes with weak genetic effects. Epigenetic processes, such as DNA methylation, histone 
modification, and various RNA-mediated processes, are also known to play an essential role 
in the regula tion of gene expression, and their influences on the gene expression levels may 



be profound (IJaenisch and Birdll2003l ). Thus, when epigenetic data are also collected on the 



same subjects, they can be similarly treated as potential instruments in the sparse IV model. 
The joint consideration of genetic and epigenetic variants as instruments is likely to yield 
stronger instruments than using the genetic variants alone, which may lead to more reliable 
genomic inference. 

Several extensions and improvements of the methodology are worthwhile to pursue. We 
have applied regularization methods to exploit the sparsity of individual coefficients, which 
allows the first-stage estimation to be decomposed into p penalized least squares problems 
and enjoys great theoretical and computational simplicity. The first-stage prediction, how- 
ever, could be substantially improved by borrowing information across the p regression prob- 
lems. One possibility would be to expl oit the structural sparsity of the coefficient matrix 



throu gh certain matrix decompositions (jChen. Chan, and Stensethl |2012| ; IChen and Huang 



20121 ). Moreover, since the 2SR method is a high-dimensional extension of the classical 2SLS 
method, it would be natural to ask whether other IV estimators such as the LIML and 
GMM estimators can also be extended to our high-dimensional setting. Although asymp- 
totic efficiency may not be a primary concern in high dimensions, certain advantages of these 
estimators in low dimensions could still carry over and lead to performance improvement. 

Appendix: Proofs 
A.l Proof of Theorem [TJ 

Since the optimization problem (J2J) can be decomposed into p penalized least squares prob- 
l ems, the result is a straightforward extension of Theorem 7.2 of 



Bickel. Ritov. and Tsybakov 



( 120091 ) to the multivariate regression case. From Condition (CI) and the aforementioned re- 
sult it follows that, with probability at least 1 — gexp(— nA|/(8cr|)), the regularized estimator 
•jj defined by (@J with the L x penalty satisfies 

||7i-75lli<16«iA i /«? (A.l) 

and 

||Z(7 J -7?)ll2<16^iA J 2 /4 (A.2) 

Using the union bound, we have, with probability at least 1 — Y7j=i <?exp(— nA|/(8erJ)), the 
regularized estimator Y = (7 1; . . . , 7 ) satisfies ||r — T 1| 1 < 16siA max /K 2 and ||Z(r— r )||^ < 
16npsiA^ iax /«i. Now, if we choose Xj = Caj^JlJogp + log q)/n with a constant C > 2\/2, 
then with probability at least 1 — (pq) l ~ c //8 , the desired inequalities hold. 
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A. 2 Proof of Theorem d 

The proof of Theorem [2] relies on two key lemmas. Lemma [A. II shows that Condition (C2), 
which is imposed on the matrix ZTo, also holds with high probability for the matrix X = ZT. 
Lemma IA.2I establishes a fundamental inequality that will be essential to the proof. 

Lemma A.l. Under Conditions (CI) and (C2), if the regularization parameters Xj are 
chosen to satisfy 



2,^2 



A m ax(2A + A max ) < 



rii f\j 



l"-2 



32 2 sis 2 ' 

then with probability at least 1 — ^? =1 gexp(— nA 2 /(8cr|)), the matrix X = ZT, where T is 
defined by ([2]) with the L\ penalty, satisfies 

«(X, S2 )>|. 

Proof. For any subset J C {l,...,p} with \J\ < s 2 and any 8 G M. p with d ^ and 
||<5j c ||i < 3||<5j||i, we have 

5 T (X T 5t - (ZT ) T ZT )6 < WSWlm^^^lSlJ^-iZ^fZ^l 



II c 112 — - II r 112 

n \\°j\\2 n \\°j\\2 

Since ||<5 JC ||i < 3||*j||i, we have \\d\\l = (||<5j||i + ||<5,/ c ||i) 2 < 16||tfj||? < 16s 2 ||£j||1- To 
bound the entrywise maximum, we write 

xfx, - (Z 7 °f Z 7 ° = (% - Z 7 °f(x, - Z 7 °) + ft - Z 7 °f Z 7 ° + {Zrfifft - Z 7 °) 



" 1 3' 

, , . , , „ „, ..., /Z( 7 .- , 



{% - 7 °f Z r Z( 7 , - tJ) + {% - 7 °f Z r Z 7 ° + (Z 7 °f Z( 7 , - 7 ° 



= Tl + T 2 + T 3 . 

We now condition on the event that (1A.1[) and (1A.2J) in the proof of Theorem 1 hold for 
j = 1, . . . , p, which occurs with probability at least 1 — £)f=i gexp(— nA|/(8crJ)). Then, by 
the Cauchy-Schwarz inequality, 

|Ti|<l|Z(7i-7?)N|Z(7 i -'75)|| 2 <16nai^ 1JIX /«?. 
Also, noting that ||zj|| 2 = y/n by standardization and || r*o || i < K> we have 

\T 2 \ < || 7i - 7 -||i max |z^||| 7 °||i 

l<k,l<q J 

||zjfc||2||zi||2||7?lli < 16AnsiA max /r 2 

Similarly, |T 3 | < 16KnsiX max / k\. Combining these bounds and using the assumption, we 
obtain 



<||7i-7°Hi max W|2||zj||2||7?IIi < 16iCnsiA max /«i. 

Kk,l<q J 



S T (X T X-(ZT ) T ZT )S ltfs lS2 ^ w l^£i£a _4±_ _ ( ^ 2 

n c- n9 — 9 -•^maxl-^-'^ T -Amaxi Ji o ' O r>o 

n.||dj|| 2 k( k( 32^sis 2 
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This, together with Condition (C2), proves the lemma. 

Lemma A. 2. Under Conditions (CI) and (C2), if n^ 2 Si{\ogp + log q) = 0{n) and the 

regularization parameters Xj are chosen as in (J7j) ; then there exist constants Cq,Ci,C2 > 

such that, if we choose 

C /si(logp + logg) 



Ki V n 

where Co = cqK max.(a p+ i, Mcr max ) ; then with probability at least 1— c\(pq)~ C2 , the regularized 
estimator j3 defined by ([3]) with the L\ penalty satisfies 



i||x(3-A,)ll! ' ^ 



2 ' §||/3-/3olli<2^||/3 s -y9°5l|. 



Proof. By the optimality of f3, we have 
1 

2n 
Substituting y = X/3 + 77, we write 



— \\y-xp\\i + M\i<7^\\y-xfr 112 



1 

2n' 



11 2 



/x||/3. 



oil 1- 



|y-X^||l = ||r7-(X/3-X/3 ^ 112 



'0JW2 



\n\\l + ||XJ3 - X/3 ||l - 2rf(x3 - X/3 ) 



l»7l 



|X(/3 - /3 ) + (X - X)/3 ||i - 2TJ 1 (X/3 - X/3 



Nl* + ||X(3 - /3 )||* + || (X - X)^ ||l - 2rf(x3 - X/3 ) 
+ 2/3 r (X-XfX(3-/3 ) 



and 



X/3, 



II 2 



|77-(X-X)# 



II 2 



|r7||^ + ||(X - X)/3 y - 2r7 J (X - X)/3, 



Combining the last three displays yields 



2n 



\x(P-0 o )\\l<ii\\Poh-i*\\P\b 

</i||/3 ||i-/i||3lh 



Vx(3 - /3 ) - i/3f (X - X) T X(3 - /3 

n n 



01 



-X T r7 - ix T (X - X)/3 C 
n n 



11/3 -# 



II 1- 



(A.3) 



Next, we condition on the event that (IA.10 and (1A.2D in the proof of Theorem 1 hold for 



J 



,p, which occurs with probability at least 1 — (pq) ' , and find a probability 



bound for the event that 



-X T rj - -X T (X - X)/3 



r? 



n 



<£. 



(A.4) 



Substituting X = ZT and X = ZT + E, we write 

-X T r7--X T (X-X)/3 
n n 
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= -f T Z T ri - -f T Z T (zf - Zr - E)/3 
n n 

= -{f - T ) T Z T r 1 + -r^Z T 77 + i(f - r ) T z T E/3 + -r^Z T E/3 
n ran n 

- i(f - r ) T z T z(f - r )/3 - ^r£z T z(f - r )/3 

= Tx + T 2 + T 3 + T 4 + T 5 + T 6 . 
b bound the term 7\, it follows from (1A.1I) . the union bound, and a Gaussian tail bound 



(JMassart 



20071 . page 19) that 



P(l|Ti||„o>£)<P 



-Z T r7 

n 



> 



/' 



16siA max 12 



— < gexp 



n 



Noting that ||r ||x < K, we have 



/' 



P T,L > ^ <P 



12 



-Z T r/ 

n 



> 



[J 



12K 



< 



gexp 



/? 



Kt 



2cr 2 x Vl6sxA r 



ii 
12 



2^ + i V 12JJT 



/' 



To bound the term T 3 , using (lA.lj) and ||/3 ||i < ^P? we obtain 



PI ll^slloo > — ] < P\ max 

11 - 12/ _ \i<i<<?,i<i<P 



1 T 
n 



> 



/-.'T 



// 



< 



pqexp 



n 



16 Sl A max 12M 

9 \ 2 

«1 /f_\ 



2aL x V16siA max 12M; 



where £,• is the jth column of the matrix E. Similarly, 



PI ll:/l||oo>^) <P\ max 

12 / 1 l<i<q,l<j<P 



n 



T 



n 



// 



^W<m}^^\-2^\^KM 



To bound the term T 5 , it follows from (1A.2J) . ||/3olli < M, and the Cauchy-Schwarz inequality 
that 

1 

n 



| T5 1 1 oo < M max 

l<i,j<p 



3(7,-7°) T Z r Z(7,-7;; 



1 Ifis A 2 

<Mmax -llza-^lhllZ^-^lb^M^i^-. 



i<m<p n 
Noting that ||zj|| 2 = \fn by standardization, we have 



I T 6 1 1 oo < KM max 

1<«<5, l<j<p 



-$ Z( 7 , - 7 ° 



77 



j 'j' 



< KM 



max _ ||Z(7--75)|| 2 
i<j<p v n 



< KM 



4y^7A r 

«1 



Combining the above bounds and using (JTj) and the assumption k x 2 sx(logp + logg) = 0(n), 
it is easy to see that, there exist constants c ,Ci,C2 > 0, if we choose 

c K 



/' 



«i 



■max(cr p+ x,M(T n 



Si(logp + logg) 



?? 
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then (IA.4|) holds with probability at least 1 — c\{pq) C2 . This, together with (IA.3I) . implies 

^||X(3 - (3 )\\j < dlflja - p0\\ x + |||3 - /3 ||i. 
Adding /x||/3 — /3 ||i/2 to both sides yields 

^||X(3-/3 )||l + |||3-/3 ||i 

< MIIA,lli - Plli + - Mi) = K\\P°sh - 8 \\i + ||fe - ^110 < 2/i||3s - Mi- 
This completes the proof of the lemma. 

Proof of Theorem^ We first note that (J7|) and (jHl) entail that the condition n^ 2 si(\ogp + 
logq) = 0(n) in Lemma [A. 2 1 is satisfied. It then follows from Lemma [A.2I that, with proba- 
bility at least 1 — Ci(pq)~ C2 , we have 

i-||X(3 - /9 )||1 < 2^0 S - /3° || x < 2/V^Ps - /3° || 2 (A.5) 

and 

|||3 - /3 |U < 2/i||3 5 - P° s \\i, or ||/fe - /&||i < 3||3 S - f3°sh- 

The last inequality, the definition of fc(X, S2), and Lemma 1 together imply 

~ ||X(g-/3 )|| 2 2||X(g-/3 )|| 2 

p s — p g 2 < < F= • ( A -0) 



Combining (1A.5J) and ( 1A.6J) . we obtain 



||X(/3-/3 )|| 2 2 <^ns 2/ i 2 

and 

||/3 - /3 |U < 4||/3 S - f3% < A^WPs ~ 0sh < ~ s ^- 

K 2 

Substituting (J5J) for // concludes the proof. 

A. 3 Proof of Theorem [3] 

Central to the proof of Theorem 3 is the following lemma, which shows that Condition (C3) 
also holds with high probability for the matrix X = ZT and gives a useful bound for the 

inverse matrix norm IKCs^) -1 ^. 

Lemma A. 3. Under Condition (CS), if the regularization parameters Xj are chosen to satisfy 

!6siS2 foTr , x \ ^ a 

nf A — a 

then with probability at least 1 — Y^j=i <?exp(— nX 2 /(8a 2 )), the matrix C = n. _1 X X = 
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n 1 (Zr) T Zr, where T is defined by (J2J) with the L\ 'penalty, satisfies 

^ A ni 

\\{Vss)^U< Wo -^ (A.7) 

2(2 — a) 

and 

||C 5C 5(C55)^||oo<l--. (A.8) 

Proof. We condition on the event that ( lA.lj) and (1A.2[) in the proof of Theorem 1 hold for 
j = 1, . . . ,p, which occurs with probability at least 1 — X^Li ?exp(— nA|/(8cr|)). From the 
proof of Lemma 1, we have 

max -|xfx, - (Z 7 °) T Z 7 °| < ^\ max (2K + A max ). (A.9) 



It follows that 



-^ 16si ot 

\Css - Cs^llooV? < s 2 — ^X max (2K + A max )y? < 



4- a 



by assumption. Then, using an error bound for matrix inversion (IHorn and Johnson 
page 336), we obtain 

life)- - (c 5S )-|U < -%^#f v < ^—^- 

l-||C 5 5-C 5 5||oo^ 2(2 -a) 
Thus, by the triangle inequality, we have 

^ ^ o/ 4 — o/ 

IKc^-loo < \\{Cs 8 )- x U + ll(Css)- 1 - (Cssr'Woo <V + ^T^ < ^3^f > 

which proves (1A.7J) . 

To show the inequality ( 1A.8J) . we write 



1985 



Cs c s(Css) —Cs c s(Css) — (Cs^s — Cs c s)(Css) —Cs^siCss) (Css — Css)(Css) ■ 
It follows from (lATfjl . ([Oil , and Condition (C3) that 

110505(055) — Cscs(Css)~ ||oo 

< II Cs c S — Cs^slloolKCss) ||oo + ||C5 C 5(C55) HooHCss' — CsslloolKCss) ||oo 

I651S2 , nT r 4 — a 16sis 2 , 4 — a 

< 2 ^max(2ii + A max J— r<£> + (1 - «) ^ A max (2A + A max J — r<£ 

ftji Z(Z Ot) K-t Z(Z Ot) 

A — a a ,„ .4 — a a a 

< — + (1 - a)— = -. 

_ 2(2 -a) 4- a v ; 2(2-a) 4- a 2 

This, along with Condition (C3), proves (1A.8J) . This completes the proof of the lemma. 
Proof of Theorem 3. We first note that the zero subgradient condition for (3 G MP to be a 



24 



solution to the optimization problem ([3]) with the L\ penalty is given by 



1 



n 

1 x T 

n a 



X|(y-X/3)=/isgn(/3 



sv> 



X/3) 



< 



/i. 



(A.10) 
(A.ll) 



where S = supp(/3). It suffices to find a (3 with the desired properties such that (lA.lOp and 
( lA.lip hold. Let (3 S c = 0. The idea of the proof is to first determine (3 S from (1A.10D . and 
then show that thus obtained (3 also satisfies flA.llj) . 

Using the same arguments as those in the proof of Lemma [A. 2\ we can show that, there 
exist constants c ,Ci,c 2 > such that, if we can choose /x as before, then with probability 
at least 1 — ci(pq)~ C2 , it holds that 

a 



ix^ - Ix^(X - X)/3 
n n 



< 



a: 



-//. 



(A.12) 



From now on, we condition on the event that ( 1A.12I) holds and analyze conditions ( 1A.10I) 
and ( fATTjl . 

We first determine f3 s from (1A.10J) . By substituting 



y - X3 = X s /3° + r) - 5tsPs = rj - (X s - X s )/3° - X S S ~ P 



1) 



(A.13) 



we write (1A.10I) with S replaced by S in the form 



^X|t7 - ^Xf(X s - X s )f3° s - C ss 0s - P°s) = /isgn(3 5 ), 



or 



Ps-P 



fC 



,, , - ' \ Ix|i7 - -X T S {X S - X s )/3° - /i sgn(3 5 ) 
n n 



(A.14) 



This, along with ( 1A.7J) and (1A.12J) . leads to 



|fe-^||oo<||(C 



1 — 1 1 



SSj 



A- a 



n 



-Xjtj 



/? 



-X Q (Xs 



X 5 )/3° 5 



/' 



o 



4- a 



/x + /i 



2-a 



/zy? < /3 n 



by assumption, which entails that sgn(/3 5 ) = sgn(/3°). Since /3 5c = f3° Sc = by definition, 
we have S = S. Let j3 s be defined by ( 1A.14I) with sgn(/3 5 ) replaced by sgn(/3° s ). Clearly, 
thus defined (3 satisfies the desired properties and condition (lA.lOj) . 

It remains to show that (3 also satisfies condition (IA.11I) . By substituting (IA.13I) and 
( 1A.14I) . we write 

ix£ c (y - X3) = ^X| C 77 - ^%(X S - X s )(3° s 
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- CscsiCssy'Uxlrj - ^X T S (X S - X s )(3° s - //sgn(3 s )|. 
It follows from (|Q|) and f[Q2|) that 



~X£,(y-XJ9) 



< 



^Xlv-^XUXs-Xs)^ 



< 



a I . n 



OG 

-Xf^ - ix|(X s - X s )/3° 



4-a^l 1 2 



1 

r? 
a 



n 



+ /i 



4- a 



/^ + /^ 



/1. 



Since S" = S", we see that (3 also satisfies (1A.11J) . which finishes the proof. 
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Table 1: Simulation results for Models 1-3. Each performance measure was averaged over 50 replicates with standard deviation 
shown in parentheses 



PLS 



2SR 



Method L\ est. loss Pred. loss TP Model size 



MCC 



Li est. loss Pred. loss TP Model size MCC 











Model 1: (n,p, q) 


Lasso 


2.49 


(0.57) 


0.79 (0.16) 


5.0 (0.0) 46.9 (11.2) 


SCAD 


2.12 


(0.53) 


0.82 (0.17) 


5.0 (0.0) 29.6 (6.0) 


MCP 


2.12 


(0.59) 


0.82 (0.17) 


5.0 (0.0) 24.3 (6.6) 


Oracle 

GO 


0.75 


(0.11) 


0.58 (0.12) 


5 (0) 5 (0) 


O 








Model 2: (n,p, q) 


Lasso 


2.71 


(0.44) 


0.81 (0.14) 


5.0 (0.0) 74.5 (14.3) 


SCAD 


2.32 


(0.36) 


0.84 (0.14) 


5.0 (0.0) 47.2 (10.2) 


MCP 


2.28 


(0.45) 


0.84 (0.15) 


5.0 (0.0) 36.2 (11.5) 


Oracle 


0.76 


(0.07) 


0.58 (0.11) 


5 (0) 5 (0) 
Model 3: (n,p, q) - 


Lasso 


3.04 


(0.39) 


0.86 (0.11) 


5.0 (0.0) 72.3 (12.5) 


SCAD 


2.64 


(0.36) 


0.89 (0.11) 


5.0 (0.0) 43.3 (11.9) 


MCP 


2.61 


(0.42) 


0.89 (0.12) 


5.0 (0.0) 33.5 (11.9) 


Oracle 


1.00 


(0.08) 


0.63 (0.09) 


5 (0) 5 (0) 



= (200,100,100), (a, b) = (0.75,1) 

0.25 (0.06) 1.47 (0.69) 0.78 (0.26) 

0.36 (0.06) 1.21 (0.55) 0.74 (0.32) 

0.42 (0.08) 1.26 (0.66) 0.82 (0.34) 

1 (0) 0.51 (0.20) 0.47 (0.22) 

= (400,200,200), (a, b) = (0.75,1) 

0.21 (0.03) 1.16 (0.52) 0.57 (0.18) 

0.29 (0.05) 0.86 (0.43) 0.49 (0.17) 

0.36 (0.08) 0.76 (0.39) 0.51 (0.19) 

1 (0) 0.41 (0.16) 0.37 (0.15) 

(400,200,200), (a, b) = (0.5,0.75) 
0.22 (0.03) 1.72 (0.73) 0.73 (0.24) 
0.32 (0.06) 1.50 (0.65) 0.69 (0.25) 
0.38 (0.09) 1.36 (0.67) 0.71 (0.26) 
1 (0) 0.57 (0.23) 0.43 (0.17) 



5.0 (0.2) 


14.5 (5.6) 


0.58 (0.12) 


5.0 (0.2) 


12.9 (4.3) 


0.62 (0.12) 


4.9 (0.2) 


9.5 (3.8) 


0.74 (0.16) 


5(0) 


5(0) 


1(0) 


5.0 (0.0) 


18.1 (7.0) 


0.54 (0.11) 


5.0 (0.0) 


14.0 (5.5) 


0.62 (0.13) 


5.0 (0.0) 


9.3 (3.3) 


0.76 (0.14) 


5(0) 


5(0) 


1(0) 


5.0 (0.0) 


18.3 (6.3) 


0.53 (0.10) 


5.0 (0.1) 


16.8 (6.7) 


0.56 (0.12) 


5.0 (0.2) 


11.0 (4.4) 


0.69 (0.14) 


5(0) 


5(0) 


1(0) 



PLS 



2SR 
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SCAD 

MCP 

Oracle 
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Figure 2: Performance curves for different methods with the dimensions p 
and the sample size n varying from 200 to 1500. 



q = 100 fixed 
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Table 2: Simulation results for Models 4-6. Each performance measure was averaged over 50 replicates with standard deviation 
shown in parentheses 









PLS 






2SR 






Method 


Li est. loss 


Pred. loss 


TP Model size 


MCC L x est. loss 


Pred. loss 


TP 


Model size 


MCC 








Model 4: (n,p, q) ■ 


= (300,600,600), (a, b) = 


(0.75, 1) 








Lasso 


2.22 (0.44) 


0.79 (0.15) 


5.0 (0.0) 69.1 (20.0) 


0.26 (0.04) 1.67 (0.81) 


0.78 (0.29) 


5.0 (0.0) 


25.5 (10.5) 


0.46 (0.10) 


SCAD 


2.03 (0.38) 


0.81 (0.17) 


5.0 (0.0) 48.6 (14.1) 


0.32 (0.05) 1.52 (0.74) 


0.70 (0.25) 


5.0 (0.0) 


26.3 (12.2) 


0.47 (0.12) 


MCP 


1.81 (0.34) 


0.79 (0.17) 


5.0 (0.0) 28.7 (9.3) 


0.43 (0.07) 1.26 (0.68) 


0.74 (0.30) 


5.0 (0.1) 


14.7 (7.1) 


0.62 (0.14) 


Oracle 

GO 


0.78 (0.08) 


0.57 (0.10) 


5 (0) 5 (0) 


1 (0) 0.42 (0.15) 


0.38 (0.15) 


5(0) 


5(0) 


1(0) 


to 






Model 5: (n,p, q) = 


= (500,1000,1000), (a, b) -- 


= (0.75, 1) 








Lasso 


2.21 (0.46) 


0.80 (0.19) 


5.0 (0.0) 87.1 (29.3) 


0.24 (0.04) 1.28 (0.78) 


0.56 (0.23) 


5.0 (0.0) 


26.8 (13.3) 


0.46 (0.11) 


SCAD 


2.05 (0.36) 


0.83 (0.20) 


5.0 (0.0) 61.2 (21.0) 


0.29 (0.06) 0.93 (0.56) 


0.48 (0.23) 


5.0 (0.0) 


21.9 (12.3) 


0.54 (0.16) 


MCP 


1.82 (0.36) 


0.80 (0.20) 


5.0 (0.0) 33.7 (15.0) 


0.41 (0.09) 0.84 (0.63) 


0.49 (0.27) 


5.0 (0.0) 


14.1 (8.9) 


0.66 (0.16) 


Oracle 


0.76 (0.07) 


0.55 (0.10) 


5 (0) 5 (0) 
Model 6: (n,p, q) = 


1 (0) 0.29 (0.11) 
(500,1000,1000), (a, b) = 


0.26 (0.11) 
: (0.5,0.75) 


5(0) 


5(0) 


1(0) 


Lasso 


2.65 (0.42) 


0.86 (0.16) 


5.0 (0.0) 86.3 (27.1) 


0.24 (0.04) 2.06 (1.32) 


0.77 (0.29) 


5.0 (0.0) 


27.6 (15.1) 


0.47 (0.14) 


SCAD 


2.39 (0.26) 


0.90 (0.17) 


5.0 (0.0) 47.2 (19.0) 


0.34 (0.07) 1.79 (0.80) 


0.70 (0.23) 


5.0 (0.0) 


28.9 (13.7) 


0.46 (0.15) 


MCP 


2.26 (0.27) 


0.89 (0.17) 


5.0 (0.0) 28.3 (13.1) 


0.45 (0.11) 1.52 (1.02) 


0.68 (0.31) 


5.0 (0.2) 


16.5 (10.5) 


0.61 (0.16) 


Oracle 


1.00 (0.07) 


0.61 (0.09) 


5 (0) 5 (0) 


1 (0) 0.40 (0.15) 


0.30 (0.12) 


5(0) 


5(0) 


1(0) 



PLS 



2SR 





Lasso 

SCAD 

MCP 

Oracle 
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Figure 3: Performance curves for different methods with the dimensions p 
and the sample size n varying from 200 to 1500. 
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Figure 4: Stability paths for different methods applied to the mouse obesity data based on 
100 subsamples. Genes with maximum selection probability at least 0.4 are displayed in 
solid lines, among which genes common to both the PLS and 2SR methods are shown in red 
and the distinct ones in blue, and the remaining genes are displayed in dashed lines. 



34 



Table 3: Genes chosen by stability selection with maximum selection probability (values 
shown) at least 0.5 and czs-SNPs (SNPs within 10 cM of each gene) identified by applying 
the 2SR method with different penalties to the mouse obesity data. Asterisks indicate genes 
that overlap those selected by the PLS method. 



Gene 



Lasso 



SCAD 



MCP 



czs-SNPs 



Igfbp2* 


0.56 


0.69 


Lamcl 




0.70 


Sirpa 


0.51 


0.55 


Gstm2* 


0.88 


0.91 


Ccnl2 


0.50 




Glccil 


0.56 




Vwf* 


0.72 


0.79 


Irx3 


0.62 




Apoa4 


0.65 




Socs2 






Avprla 


0.78 




Abca8a 


0.50 




Gpldl 


0.50 




Faml05a 




0.60 


Dscam 




0.60 


Slc22a3 


0.90 




2010002N04Rik 


0.73 


0.65 



0.89 



0.50 



0.68 



rs4137196, rs3722983, rs4231406 



rs4136518 
rs3720634 



rs3661189, rs3655324 

rs3663003 

rs3694833 
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